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Abstract. 

In this Note we introduce a new methodology for Bayesian inference through the use 
of 0-divergences and the duality technique. The asymptotic laws of the estimates are 
established. 
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1 Introduction 

Bayesian techniques are particularly attractive since they can incorporate information 
other than the data into the model in the form of prior distributions. Another feature 
which make them increasingly attractive is that they can handle models that are diffi- 
cult to estimate with classical methods by use of simulation techniques, see for instance 
Robert (2001). 

The aim of this Note is to discuss the use of divergences as a basis for Bayesian in- 
ference. The use of divergence measures in a Bayesian context has been considered in 
Dey and Birmiwal (1994) and Peng and Dey (1995). Most of this work is concerned with 
the use of divergence measures to study Bayesian robustness when the priors ai^e contam- 
inated and to diagnose the effect of outliers. 

In order to estimate model parameters and circumvent possible difficulties encountered 
with the likelihood function, we follow up common robustification ideas, see for instance 
Hanousek (1990, 1994), and propose to replace the likelihood in the formula of the pos- 
terior distribution by the dual form of the divergence that lead to estimators that are both 
robust and efficient and include the expected a posteriori estimator (EAP) as a benchmark. 
A major advantage of the method is that it does not require additional accessories such 
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as kernel density estimation or other forms of nonparametric smoothing to produce non- 
parametric density estimates of the true underlying density function in contrast with the 
method proposed by Hooker and Vidyashankar (201 1) which is based on the concept of a 
minimum disparity procedure introduced by Lindsay (1994). The plug-in of the empirical 
distribution function is sufficient for the purpose of estimating the divergence in the case 
of i.i.d. data. The proposed estimators are based on integration rather than optimization. 
Other reasons, which are commonly put forward to use the proposed approach is compu- 
tational attractiveness through the use of MCMC and can easily handle a large number of 
parameters. 

The outline of the Note is as follows. Together with a brief review of definitions and 
properties of divergences, Section 2 discusses the procedure to obtain the estimates. In 
Section 3, we give the limit laws of the proposed estimators. Some final remarks conclude 
the Note. 

2 Estimation 

2.1 Background on dual divergences inference 

Keziou (2003) and Broniatowski and Keziou (2009) introduced the class of dual diver- 
gences estimators for general parametric models. In the following, we shortly recall their 
context and definition. 

Recall that the (/)-divergence between a bounded signed measure Q and a probabiUty P 
on when Q is absolutely continuous with respect to P, is defined by 

D^{Q,P) := jj(^{x)^ dP(x), 

where </> is a convex function from ] — cxd, c!o[ to [0, oo] with 0(1) = 0. 
Different choices for have been proposed in the literature. For a good overview, see 
Pardo (2006). Well-known class of divergences is the class of the so called "power diver- 
gences" introduced in Cressie and Read (1984) (see also Liese and Vajda (1987) chapter 
2); it contains the most known and used divergences. They ai^e defined through the class 
of convex functions 

— ^x -\~ ^ — 1 

X G]0, +oo[^ (^^(x) := 1 (1) 

if 7 G M \ {0, 1}, (t)Q{x) := — logx + x — 1 and 0i(x) := xlogx — x + 1. 
Let Xi,. . . ,Xn be an i.i.d. sample with p.m. Fg^. Consider the problem of estimating 
the population parameters of interest 9q, when the underlying identifiable model is given 
by {Fq : ^ G G} with © a subset of M'^. Here the attention is restricted to the case where 
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the probability measures Fq are absolutely continuous with respect to the same cr-finite 

measure A; correspondent densities are denoted pQ. 

Let (/) be a function of class C^, strictly convex and satisfies 



Pa{x) 



Po{x) dx < oo. 



(2) 



By Lemma 3.2 in Broniatowski and Keziou (2006), if the function (/) satisfies: There exists 
< 7? < 1 such that for all c in [1 — r/, 1 + r]], we can find numbers ci, C2, C3 such that 



(ex) < cicj}{x) + C2 |x| + C3, for all real x, 



(3) 



then the assumption (2) is satisfied whenever D^{F0,Fa) is finite. From now on, U will 
be the set of 9 and a such that D^{F0,Fa) < 00. Note that all the real convex functions 
0^ pertaining to the class of power divergences defined in (1) satisfy the condition (3). 
Under (2), using Fenchel duality technique, the divergence D^{6, 9q) can be represented 
as resulting from an optimization procedure, this elegant result was proven in Keziou 
(2003), Liese and Vajda (2006) and Broniatowski and Keziou (2009). Broniatowski and Keziou 
(2006) called it the dual form of a divergence, due to its connection with convex analysis. 
Under the above conditions, the (/(-divergence: 



Pejx) 
Peoix) 



P0oix) dx, 



can be represented as the following form: 

D4Fe,Fe,) 
where h{6, a) : x i-)- h{9, a, x) and 

h{9,a,x) := j (f) [^-^ Pe 



sup / h{9, a) dFeQ, 



pe{x) , ( pe{x) 



(4) 



Pa{x) \Pa{x) 



Pejx) 

Pa{x) 



(5) 



Since the supremum in (4) is unique and is attained in a = 9q, independently upon the 
value of 9, by replacing the hypothetical probability measure P^o by the empirical measure 
Fn define the class of estimators of 6*0 by 



argsup / h{9,a)dFn, 9 £ Q, 



(6) 



where h{9, a) is the function defined in (5). This class is called "dual ^-divergence esti- 
mators" (D(/)DE's), see for instance Keziou (2003) and Broniatowski and Keziou (2009). 
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Formula (6) defines a family of M-estimators indexed by the function (j) specifying the di- 
vergence and by some instrumental value of the parameter 9, called here escort parameter, 
see also Broniatowski and Vajda (2009). 

Application of dual representation of (/)-divergences have been considered by many au- 
thors, we cite among others, Keziou and Leoni-Aubin (2008) for semi-parametric two- 
sample density ratio models, robust tests based on saddlepoint approximations in Toma and Leoni-Aubin 
(2010), Toma and Broniatowski (2010) have proved that this class contains robust and 
efficient estimators and proposed robust test statistics based on divergences estimators. 
Bootstrapped (/)-divergences estimates ai^e considered in Bouzebda and Cherfi (2011), ex- 
tension of dual (/(-divergences estimators to right censored data are introduced in Cherfi 
(2011a), for estimation and tests in copula models we refer to Bouzebda and Keziou 
(2010) and the references therein. Performances of dual ^-divergence estimators for nor- 
mal models are studied in Cherfi (201 lb). 



2.2 Estimation 

Let us now turn to the estimation using divergences in our setting. For the parameter 9 
consider a prior density tt{9) on G, and let p{x, 9) : M x G be a suitable function. Then 
Hanousek (1990) considered the following Bayes-type or B-estimator of 9q, conespond- 
ing to the prior density 'k{9) and generated by the function 9), 

p ^ /egexp{-X:r=iP(-^^.^)M^)dg 
" Ie^^P{-^=lPiX^,9)}n{9)d9 

if both integrals exist. This type of estimators is often called Laplace type estimators see 
for instance Chernozhukov and Hong (2003). 
The posterior M-estimator is defined as 

^+ = argmax f- Vp(Xi,0) + ln7r(6') ) . (8) 



=1 



Hanousek (1990) showed that 0* is asymptotically equivalent to the M-estimator gen- 
erated by p for a large class of priors and under some conditions on p and P^y. The 
asymptotic equivalence provides the access to the study of asymptotics for B -estimators 
via the M-estimators. 

In the context of the Bayesian methods examined in this Note, instead of a likelihood 
function, our work will use a criterion function J h{9, a)d¥n- 
Inference is based on the (/(-posterior 

P<P,nHXi,--- ,Xn) = J \\ f \A ■ (9) 

Jy exp {nrnn.[9, a)\TT[a) da 
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A risk function is the expected loss or eiTor in which the researcher incurs when choosing 
a certain value for the parameter estimate. Let Cn{u) be a loss function. The risk function 
takes the form 



TZn{a) = / £,n{a - a) Ptp^n{a\Xi,- ■ ■ ,Xn) da, 
Ju 



(10) 



where „(q:|Xi , • • • , X„) is the (/(-posterior density, a is the selected value, and a is all 
other possible values we are integrating over. The loss function can penalize the selection 
of a asymmetrically, and is a function of the selected value and the rest of the possible 
values of the parameters in U. 

The dual ^-divergence Bayes type estimator minimizes the expected loss for different 
forms of the loss function 

= argmf 7^„(5). (11) 

Choosing different loss functions will change the objective function such that the estima- 
tors bear- different interpretations. For instance, when the loss is squai^ed eiTor {Cn{u) = 
\^/nu\'^), for fixed 6, the dual i;^-divergence Bayes type estimator is defined as 



al{e)= [ a p^,n{a\Xu • • • , X„) da := ^ 17 !;^^ZT" I'^l'r 'TJ ^ (12) 
Ju 



a exp {nFnh{6, a)}7r(Q!) 
Jlj exp {nFnh{6 , a)}'iT{a) da' 



if both integrals exist, other familiar forms obtained for different loss functions ai^e modes, 
medians and quantiles. 

The posterior dual (/)-divergences estimator is defined as 

a+((9) = argsup(P„/i((9,a)+ln7r(a)). (13) 

It is obvious that posterior dual ^-divergences estimates naturally inherit the properties 
of dual (/)-divergences estimates and hence we focus on dual ^-divergences Bayes type 
estimators only. 

Remark 1 1 . The EAP estimator, which is the mean of the posterior distribution, be- 
longs to the class of estimates (12). Indeed, it is obtained when (/)(x) = — log x + 

x—l, that is as the dual modified KLm-divergence estimate. Observe that (j)'{x) = hi 



and x(t)'{x) — = logx, hence 

j h{e,a)dFn = - /log (^) dPn. 
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Keeping in mind definitions (12), we get 

a JJpQ,(Xj)7r(a) da 



/ ]^PQ(^j)vr(a) da 
■^w i=i 



independently upon 



2. If new data , . . . , are obtained, the posterior for the combined data . . . ^Xfq 
can be obtained by using posterior after n observations, „(a|Xi, • • • , X„) as a 
prior a: 

P0,n(a|-'^1, • • • i-'^Tv) '^V<i>,n{pi\Xx, - ■ ■ ,X„) X , • • • ,XAr|a). 



3 Asymptotic properties 



In this section we state the asymptotic normality of the estimates based on the ^-posterior 
and evaluate their limiting variance. The hypotheses handled here ai^e similar- to those 
used in Keziou (2003) and Broniatowski and Keziou (2009) in the frequentist case, these 
conditions are mild and can be satisfied in most of circumstances. From now on, 
denotes the convergence in distribution. 

(R.1) 



sup 

(R.2) There exists a neighborhood N{9q) of Oq such that the first and second order par- 
tial derivatives (w.r.t a) of (f>' ( \ \ ) peix) are dominated on N(6q) by some 

\Pa{x)J 

integrable functions. The third order partial derivatives (w.r.t a) of h{9, a, x) are 
dominated on N{9o) by some Pgg -integrable functions. 



Let 



S := -^eo-^h {6, Oo) and V := ^e,-^h {6, Oo f -^h {9, 9o) . 



Observe that the matrix 5 is symmetric and positive since the second derivative 
nonnegative by the convexity of 0. 

(R.3) The matrices S and V are non singular. 



6 



For a in an open neighborhood of Oq, using (R.2) by a Taylor expansion 

Fnh {9, a) - P„/i {9, 0o) = (a - 9Q)'Un{9o) " ^(a " O^V S{a - 9^) + i?„(a), (14) 
(R.4) Given any e > 0, there exists 5 > Q such that, the probability of the event 

sup \Rn{a)\ > e (15) 

|a-eo|<<5 

tends to zero as n — > oo. 

Remark 2 1. Using Example 19.8 in van der Vaart (1998), it is clear that the class of 
functions {a i— h{9,a); a € 0} is a Glivenko-Cantelli class of functions for all 
fixed 6*, that (R.l) holds. 

2. Conditions (R.2) and (R.3) are about usual regularity properties of the underly- 
ing model, they guarantee that we can interchange integration and differentiation 
and the existence of the variance-covariance matrices, they ai^e similar to regular- 
ity conditions used in Keziou (2003) and Broniatowski and Keziou (2009) in the 
frequentist case. 

3. Condition (R.4) easily holds when there is enough smoothness. It requires that 
the remainder term of the expansion can be controlled in a particular way over a 
neighborhood of ^o- 

Define 

t:=^{a- A„) , An := 9o + S-^Un{9o), (16) 

and p*^n{t) be the (/)-posterior density of t. 

The following theorem states that under some regularity conditions, for large n, „(•) 
is approximately a random normal density in the Li sense. 



Theorem 1 Let 7r(0) be any prior that is continuous and positive at with / \9\'k{9)(19. 
Then under Conditions (R. 1-4) 



dt A 0. (17) 



We now state the principal result of this section. Theorem 2 is concerned with the effi- 
ciency and asymptotic normality of the proposed estimates. See Ibragimov and Has'minskii 
(1981) and Strasser (1981) for more on the consistency and efficiency of Bayes estima- 
tors. 
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Theorem 2 Let ^{O) be any prior that is continuous and positive at with j \9\it{6) dO. 
Assume that Conditions (R.1-4) hold, then as n tends to infinity 

Remarks If 6 = 6q, then S'^V^^S = Ig^^ the information matrix, so that a*^{Oo) is 
consistent and asymptotically efficient. The consequence is that the value of the escort 
parameter should be taken as a consistent estimator of ^o, see Cherfi (201 la,b) for relevant 
discussion on this subject. 



4 Concluding remarks 



We have introduced a new estimation procedure in parametric models that combine diver- 
gences method with Bayesian analysis, it generalizes the expected a posteriori estimate. 
The proposed estimators are based on integration rather than optimization. These es- 
timators are often much easier to compute in practice than the arg sup estimators (6), 
especially in the high-dimensional setting; see, for example, the discussion in Liu et al. 
(2008). 

In order to compute these estimators, using Markov chain Monte Carlo methods, we can 
draw a Mai^kov chain, 

S = (a«;a(2);... ;a(^)); (18) 

whose marginal density is approximately given by p^^n{')^ the </)-posterior distribution. 
Then the estimate Q*^{0) is computed as 

1 ^ 

%{0) = -Y,c^^^- (19) 

i=l 

Consider the construction of confidence intervals for the quantity /(^o)> for ^ given con- 
tinuously differentiable function / : — >• R. Define 



Cn(e) ■.= 'm.ilx\ I a p^^nia) da > e > 

[ Jf{a)<x J 



(20) 



Cn. ( 2 ) ' 



Then the dual (^-divergence Bayes type estimator confidence interval is given by 

These confidence intervals can be constructed simply by taking the ^th and ^th quantiles 
of the MCMC sequence 

/(S) = (/(a«);/(a(2));... ;/(a(^))), (21) 
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and thus are quite simple in practice. 

Tiie very peculiar choice of the escort parameter defined through 6 = 9q has same limit 
properties as the posterior mean. This result is of some relevance, since it leaves open 
the choice of the divergence, while keeping good asymptotic properties, we expect that it 
can also be used directly to provide robust inference, we leave this study for a subsequent 
paper. 

The problem of the choice of the divergence remain an open question and need more 
investigation. 



5 Proofs 

Our arguments follow those presented by Lehmann and Casella (1998), the main differ- 
ence is due to the non-likelihood setting. See also Chernozhukov and Hong (2003) for 
similar arguments. We often use M to denote a generic finite constant and / to denote the 
identity matrix. The smallest-eigenvalue of a matrix S is denoted as mineig(5). 



5.1 Proof of Theorem 1 

Define 



then 



t:=V^{a- An) , An := + S-^Un{eo), (22) 



Plnit) = ^p^^^(2= + An] (23) 





+ A„,^ 


1 exp < 








[ 




+ A„^ 


1 exp 1 




I — 




> du 



vr ( + A„ ) exp{uj{t)} 



where 



and 



(24) 



coit) := nFnh (9, — + An]- nP„/i {0, 9o) - -Un{9o) ' S-'Uni9o), 
In I 



Cn := y TT + exp{a;(n)} du. 
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Lemma 1 Let 



Ji 



(25) 



then if (R. 1-4) hold, Ji 0. 
By Lemma 1 , we have that 



Observe that 



det5\ 



d/2 



2tt 



I exp<|--t'5t 



IdetSr 



J 

dt = — , 



(26) 



where 



J :-- 



dt 



(27) 



By (26), to show (17) it is enough to show that J — > 0. But, J < Ji + J2 where Ji is 
given by (25) and 



Observe that 



J2 



J2 



\detS\ ItTst ^ kT^i 



dt. 



|det5| , 



(28) 



By Lemma 1 and (28), Ji and J2 tend to zero in probability, and this completes the proof. 



Proof of Lemma 1 

Let 



Un{9o) ■.= Pn^h{9,eo). 

oa 



Using (R.2) and (R.3) in connection with the central limit theorem (CLT), we can see that 

V^V~'/^Un{eo) ^M{0,I). (29) 
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Write 



n 



(30) 



To prove that the integral (25) tends to zero in probability, divide the range of integration 
into the three parts: 

(i) \t\ < M, 

(ii) |t| > 6^/n, 

(iii) M < |t| < 

and show that the integral over each of the three tends to zero in probability. 
Part(i): 



J\t\<M ) 

To prove this result, we shall show that for every < M < oo. 



A 0. 



sup 

|t|<M 



n 



0. 



(31) 



Substituting the expression (30) for a;(t), (31) is seen to follow from the following two 
facts 



sup 

\t\<M 







and 



sup 

\t\<M 



n 



0. 



(32) 



(33) 



The first fact is obvious from the continuity of vr and because by Condition (R.3) and (29): 



(34) 



so that — > 6q. 

Given (34), the second fact follows from Condition (R.2), and 



sup 

\t\<M 



A, 



t 



n 



Op{ 



1 



n 



Part(ii): 



M<\t\<5yfTi 



vr 



J 



vr (6*0)6 2 



dt A 0. 
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For the second part, since the integral of the second term is finite and can be made arbi- 
trarily small by setting M large, it suffices to show that for the integrand of the first term 
is bounded by an integrable function with probability > 1 — e. More precisely, we shall 
show that given e > 0, there exists (5 > and C < oo such that for sufficiently large n. 



P 



t 



n 



+ A„ ) e'^W < Ce-3*^"5t fpj. M <\t\< 



> 1 - e. 



(35) 



By the fact that A„ — > and the continuity of vr, we can drop the factor tt ( — = + A„ 



n 



from consideration, so that it remains to establish such a bound for exp {uj{t)}. By defi- 
nition of a;(t) (30) 



exp {uj{t)} < exp <^ --t'^St + i?„ ^ + A„ 



n 



(36) 



Since |A„ — ^ol = op(l), it follows that with probability arbitrarily close to 1, for n 
sufficiently large, 

A„ + -^-eo < 25' for all Itl < 

n 



Thus, by Condition (R.4), there exists some small 5' and large M such that the latter 
inequality implies 



P 









sup 


i?„(^ + A„) 


< 


M<\t\<5'yfn. 







> 1 - e. 



Combining this fact with (34), we see that (36), for some C > 0, is 

exp < Cexpj-V^tj, 



(37) 



for all t satisfying (ii), with probability arbitrarily close to 1, and this establishes (35). 
Pait(iii): 



L 



\t\>5^ 



TT 



t 



dt A 0. 



As in (ii), the second term in the integrand can be neglected. Therefore we only need to 
show 



vr 



n 



0. 



Recalling the definition of t, the term is bounded by 

TT (a) exp [vFnh {0, a) - nF^h (9, Oq) - ^t/„(^o)^5^'f/n(^o)| da. 



'|a-A„|><5 
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By (R.4), for any 5 > 0, there exists e > 0, such that 

sup {Fnh{e,a)-Fnh{e,eo))<-e. (38) 

p 

Since A„ — > 9q, therefore with probability tending to 1, there exists e such that 

sup exp {nFnh {6, a) - nF^h {9, Oq)} < e""'. 

|a-A„|><5 

Since exp | — ^C/n(^o)^5'~^f^n(^o)| = Op{l), the entire term is bounded by 

CV^e""^y" 7r(a) da = op(l), 
with probability tending to 1. 

The entire proof is now completed by combining all terms. 

5.2 Proof of Theorem 2 

We have 

V~^/^SV^ {a;{9) - 9o) = V-'/^SV^ {a;{9) - A„) + V-'/^SV^iAn - 9o) . 
By the CLT, the second term has the limit distribution M (0, /), so that it only remains to 



show that 
Observe that 



{al{9) - A„) A 0. 



(39) 



and hence 

Thus, 

V^\a*J9)-A, 



il{9) = J ap^^n{a) da 

= /(^ + ^-) p;,„(t)dt 

= j t pl^it) dt + An, 

V^{a;{9)-An) = j tpl^it) dt. 



(40) 



tplnit)dt- / t 



detsy/' 



2tt 



1 



exp <^ — t' St} dt 



< \t 



P%,n{t) 



detS' 



d/2 



2-K / 



exp {—-t St 



dt, 



which tends to zero in probability by Theorem 1. 
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